34 research outputs found
Content-Based Weak Supervision for Ad-Hoc Re-Ranking
One challenge with neural ranking is the need for a large amount of
manually-labeled relevance judgments for training. In contrast with prior work,
we examine the use of weak supervision sources for training that yield pseudo
query-document pairs that already exhibit relevance (e.g., newswire
headline-content pairs and encyclopedic heading-paragraph pairs). We also
propose filtering techniques to eliminate training samples that are too far out
of domain using two techniques: a heuristic-based approach and novel supervised
filter that re-purposes a neural ranker. Using several leading neural ranking
architectures and multiple weak supervision datasets, we show that these
sources of training pairs are effective on their own (outperforming prior weak
supervision techniques), and that filtering can further improve performance.Comment: SIGIR 2019 (short paper
Pseudo Label Is Better Than Human Label
State-of-the-art automatic speech recognition (ASR) systems are trained with
tens of thousands of hours of labeled speech data. Human transcription is
expensive and time consuming. Factors such as the quality and consistency of
the transcription can greatly affect the performance of the ASR models trained
with these data. In this paper, we show that we can train a strong teacher
model to produce high quality pseudo labels by utilizing recent self-supervised
and semi-supervised learning techniques. Specifically, we use JUST (Joint
Unsupervised/Supervised Training) and iterative noisy student teacher training
to train a 600 million parameter bi-directional teacher model. This model
achieved 4.0% word error rate (WER) on a voice search task, 11.1% relatively
better than a baseline. We further show that by using this strong teacher model
to generate high-quality pseudo labels for training, we can achieve 13.6%
relative WER reduction (5.9% to 5.1%) for a streaming model compared to using
human labels.Comment: 6 pages, 2 figures, 9 tables, submitted to INTERSPEEC
Fix it where it fails: Pronunciation learning by mining error corrections from speech logs
The pronunciation dictionary, or lexicon, is an essential component in an automatic speech recognition (ASR) system in that incorrect pronunciations cause systematic misrecognitions. It typically con-sists of a list of word-pronunciation pairs written by linguists, and a grapheme-to-phoneme (G2P) engine to generate pronunciations for words not in the list. The hand-generated list can never keep pace with the growing vocabulary of a live speech recognition sys-tem, and the G2P is usually of limited accuracy. This is especially true for proper names whose pronunciations may be influenced by various historical or foreign-origin factors. In this paper, we pro-pose a language-independent approach to detect misrecognitions and their corrections from voice search logs. We learn previously un-known pronunciations from this data, and demonstrate that they sig-nificantly improve the quality of a production-quality speech recog-nition system. Index Terms — speech recognition, pronunciation learning, data extraction, logistic regression 1
Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion
Self-supervised pre-training of a speech foundation model, followed by
supervised fine-tuning, has shown impressive quality improvements on automatic
speech recognition (ASR) tasks. Fine-tuning separate foundation models for many
downstream tasks are expensive since the foundation model is usually very big.
Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods)
offer an alternative paradigm where a small set of parameters are updated to
adapt the foundation model to new tasks. However, these methods still suffer
from a high computational memory cost and slow training speed because they
require backpropagation through the entire neural network at each step. In the
paper, we analyze the performance of features at different layers of a
foundation model on the speech recognition task and propose a novel
hierarchical feature fusion method for resource-efficient transfer learning
from speech foundation models. Experimental results show that the proposed
method can achieve better performance on speech recognition task than existing
algorithms with fewer number of trainable parameters, less computational memory
cost and faster training speed. After combining with Adapters at all layers,
the proposed method can achieve the same performance as fine-tuning the whole
model with fewer trainable encoder parameters and faster training
speed